Skip to content

refactor: Migrate Content Understanding from preview to GA and consolidate AI Services account#575

Merged
Harsh-Microsoft merged 9 commits intodevfrom
psl-hb-us-41641
May 8, 2026
Merged

refactor: Migrate Content Understanding from preview to GA and consolidate AI Services account#575
Harsh-Microsoft merged 9 commits intodevfrom
psl-hb-us-41641

Conversation

@Harsh-Microsoft
Copy link
Copy Markdown
Contributor

@Harsh-Microsoft Harsh-Microsoft commented May 6, 2026

Summary

Migrate Azure AI Content Understanding from preview 2024-12-01-preview to GA 2025-11-01, and consolidate the standalone Content Understanding Cognitive Services account into the existing unified Azure AI Services account (which now hosts both Azure OpenAI and Content Understanding).

Why two changes in one PR

In preview, CU was only available in 3 regions, which forced us to provision it as a separate Microsoft.CognitiveServices/accounts (aicu-*) alongside the OpenAI account (aif-*). With GA, CU's region footprint expanded enough that a single unified account can host both services in 11 common regions, so the migration and the consolidation make sense to ship together.

Infra changes

  • Drop avmAiServices_cu module, contentUnderstandingPrivateEndpoint, and the contentUnderstandingLocation parameter from main.bicep and main_custom.bicep; mirror in main.json.
  • Restrict azureAiServiceLocation @allowed to the 11-region intersection where both CU GA and gpt-5.1 GlobalStandard are available.
  • Add two Cognitive Services User role assignments (API and Workflow managed identities) on the unified account so CU calls don't 403.
  • Re-route APP_CONTENT_UNDERSTANDING_ENDPOINT to the unified account (CU lives at the same cognitiveservices.azure.com host, only the path differs).
  • Drop AZURE_ENV_CU_LOCATION mapping from main.parameters.json and main.waf.parameters.json.
  • Remove contentUnderstandingLocation override from .github/workflows/deploy.yml.

Application code changes

  • Bump api-version to 2025-11-01 and switch to the GA REST surface in src/ContentProcessor/src/libs/azure_helper/content_understanding.py: :analyzeBinary for stream payloads, knowledgeSources[] for training data, and /files/{id} for figure retrieval.
  • Update Pydantic models for GA: add Warning, relax Page optionals (angle/spans/words/lines), and surface the new top-level DocumentContent.paragraphs field.
  • Add unit tests for the new Warning model and relaxed Page optionals; bump existing apiVersion fixtures.

Docs

  • CustomizingAzdParameters.md: drop AZURE_ENV_CU_LOCATION row, rewrite AZURE_ENV_AI_SERVICE_LOCATION row, and append a usageName note for the Standard deployment type.
  • LocalDevelopmentSetup.md: replace stale aicu-{suffix} reference with aif-{suffix}.
  • TroubleShootingSteps.md: update the CU 403 row for the consolidated account name and DNS zones.

Validation

  • ruff check passes on touched files.
  • pytest tests/unit/azure_helper/test_content_understanding_model.py — 15/15 passed.
  • az bicep build clean on both main.bicep and main_custom.bicep.
  • End-to-end test on the GA-deployed environment confirms functional parity with the preview baseline (same fields populated, same fields missed).

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

…ervices account

Migrate Azure AI Content Understanding from 2024-12-01-preview to GA
2025-11-01 (ADO 41641), and consolidate the standalone Content
Understanding Cognitive Services account into the existing unified
Azure AI Services account (now hosting both Azure OpenAI and CU).

Infra
- Drop avmAiServices_cu module, contentUnderstandingPrivateEndpoint,
  and the contentUnderstandingLocation parameter from main.bicep and
  main_custom.bicep; mirror the changes in main.json.
- Restrict azureAiServiceLocation @Allowed to the 11-region
  intersection where both CU GA and gpt-5.1 GlobalStandard are
  available.
- Add two Cognitive Services User role assignments (API and Workflow
  managed identities) on the unified account so CU calls don't 403.
- Re-route APP_CONTENT_UNDERSTANDING_ENDPOINT to the unified account.
- Drop AZURE_ENV_CU_LOCATION mapping from main.parameters.json and
  main.waf.parameters.json.
- Remove contentUnderstandingLocation override from
  .github/workflows/deploy.yml.

Application code
- Bump api-version to 2025-11-01 and switch to the GA REST surface:
  :analyzeBinary for stream payloads, knowledgeSources[] for training
  data, and /files/{id} for figure retrieval.
- Update Pydantic models for GA: add Warning, relax Page optionals
  (angle/spans/words/lines), and surface the new top-level
  DocumentContent.paragraphs field.
- Add unit tests for the new Warning model and relaxed Page
  optionals; bump existing apiVersion fixtures.

Docs
- CustomizingAzdParameters.md: drop AZURE_ENV_CU_LOCATION row, rewrite
  AZURE_ENV_AI_SERVICE_LOCATION row, and append a usageName note for
  the Standard deployment type.
- LocalDevelopmentSetup.md: replace stale aicu-{suffix} reference.
- TroubleShootingSteps.md: update the CU 403 row for the consolidated
  account name and DNS zones.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
libs/azure_helper/model
   content_understanding.py92297%71, 94
TOTAL121716186% 

Tests Skipped Failures Errors Time
244 0 💤 0 ❌ 0 🔥 4.788s ⏱️

@Harsh-Microsoft Harsh-Microsoft changed the title Migrate Content Understanding from preview to GA and consolidate AI Services account refactor: Migrate Content Understanding from preview to GA and consolidate AI Services account May 6, 2026
Replace /docs/re-use-*.md with relative paths so the lychee link
checker resolves them. Pre-existing links flagged on this PR because
the file was modified by the GA migration commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates the Content Processor’s Azure AI Content Understanding integration from the preview API to the GA API version and updates the infrastructure to consolidate Content Understanding into the existing unified Azure AI Services account (hosting both Azure OpenAI and Content Understanding).

Changes:

  • Update the Content Understanding REST client + Pydantic response models to the GA API surface (2025-11-01) and new endpoints/actions.
  • Refactor infra parameters/templates to remove the standalone CU account + location override and route CU endpoint to the unified AI Services account, including required RBAC assignments.
  • Update tests and docs to reflect GA model shape changes and the consolidated account/private networking guidance.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/ContentProcessor/src/libs/azure_helper/content_understanding.py Switch to GA API version and new :analyzeBinary/knowledgeSources//files/{id} behaviors.
src/ContentProcessor/src/libs/azure_helper/model/content_understanding.py Update response models for GA (structured warnings, relaxed Page fields, add top-level paragraphs).
src/ContentProcessor/tests/unit/azure_helper/test_content_understanding_model.py Update unit tests for GA apiVersion and structured warnings / optional Page fields.
src/tests/ContentProcessor/azure_helper/test_content_understanding_model.py Mirror GA model/unit test updates in the parallel test suite.
infra/main.bicep Remove CU-specific account/location + private endpoint, restrict AI Services regions, add RBAC on unified account, reroute CU endpoint.
infra/main_custom.bicep Same as main.bicep for custom deployment path.
infra/main.json Regenerated ARM template reflecting consolidation/removals and updated parameters/role assignments.
infra/main.parameters.json Remove contentUnderstandingLocation parameter mapping.
infra/main.waf.parameters.json Remove contentUnderstandingLocation parameter mapping for WAF deployments.
.github/workflows/deploy.yml Stop passing contentUnderstandingLocation override during deployments.
docs/CustomizingAzdParameters.md Remove AZURE_ENV_CU_LOCATION, update AI Services location guidance and Standard/GlobalStandard note.
docs/LocalDevelopmentSetup.md Update example CU endpoint host to aif-{suffix} (unified account).
docs/TroubleShootingSteps.md Update CU 403 troubleshooting guidance for unified account and DNS/private endpoint expectations.
Comments suppressed due to low confidence (1)

src/ContentProcessor/src/libs/azure_helper/content_understanding.py:327

  • get_image_from_analyze_operation() is now documented as retrieving a generic generated file via /files/{id}, but it still (a) uses the parameter name image_id and (b) asserts Content-Type == "image/jpeg". That assertion can fail for non-JPEG file types (or if the service changes MIME types) and would raise unexpectedly in production. Either narrow the method back to images-only (and document that guarantee) or relax the check / return the content without asserting a specific MIME type.
    def get_image_from_analyze_operation(
        self, analyze_response: Response, image_id: str
    ):
        """Retrieves a generated file (e.g., a rendered page image) from a
        completed analyze operation by its file id / path.

        In Content Understanding GA the file-retrieval URL changed from
        ``{operationLocation}/images/{imageId}`` to
        ``{operationLocation}/files/{fileId}`` (where ``operationLocation`` now
        ends in ``/analyzerResults/{operationId}``).

        Args:
            analyze_response (Response): The response object from the analyze operation.
            image_id (str): The id (or path) of the file to retrieve.
        Returns:
            bytes: The file content as a byte string.
        """
        operation_location = analyze_response.headers.get("operation-location", "")
        if not operation_location:
            raise ValueError(
                "Operation location not found in the analyzer response header."
            )
        operation_location = operation_location.split("?api-version")[0]
        image_retrieval_url = (
            f"{operation_location}/files/{image_id}?api-version={self._api_version}"
        )
        try:
            response = requests.get(url=image_retrieval_url, headers=self._headers)
            response.raise_for_status()

            assert response.headers.get("Content-Type") == "image/jpeg"

            return response.content

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ContentProcessor/src/libs/azure_helper/content_understanding.py
The existing Foundry must support both gpt-5.1 (GlobalStandard) and Content Understanding GA, otherwise deployment will fail with downstream model/analyzer errors.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 7, 2026 06:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 4 comments.

Comment thread src/ContentProcessor/src/libs/azure_helper/model/content_understanding.py Outdated
Comment thread src/ContentProcessor/src/libs/azure_helper/model/content_understanding.py Outdated
Comment thread docs/re-use-foundry-project.md Outdated
Deploy may fail at provisioning time for unsupported gpt-5.1 region, or appear to succeed but break at runtime when Content Understanding GA is unavailable in the existing Foundry's region.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…oject.md

Lychee CI rejects root-relative paths when no base dir is configured. Switch to relative ./DeploymentGuide.md paths matching the fix already applied to CustomizingAzdParameters.md.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 7, 2026 07:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/ContentProcessor/src/libs/azure_helper/content_understanding.py:327

  • get_image_from_analyze_operation() now targets the GA /files/{id} endpoint and the docstring describes retrieving a generic generated file, but the code still asserts Content-Type == image/jpeg. If the service returns a different media type (or omits the header), this will raise AssertionError and bypass the RequestException handler. Consider removing the assert or validating/handling content types more robustly (and raising a normal exception on unexpected types).
        try:
            response = requests.get(url=image_retrieval_url, headers=self._headers)
            response.raise_for_status()

            assert response.headers.get("Content-Type") == "image/jpeg"

            return response.content

Comment thread docs/re-use-foundry-project.md Outdated
Harsh-Microsoft and others added 2 commits May 7, 2026 13:21
Align the listed regions with the @Allowed list enforced by
infra/main.bicep and infra/main_custom.bicep for the unified AI
Services account (japaneast, southeastasia, uksouth added;
northcentralus, switzerlandnorth, westus2 removed).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace Optional[...] = None and mutable [] defaults on Page.spans,
Page.words, DocumentContent.paragraphs, and ResultData.warnings with
Field(default_factory=list). Removes a possible TypeError when the
confidence evaluator iterates page.words and aligns with the repo's
Pydantic convention. Tests updated to assert empty-list defaults.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/ContentProcessor/src/libs/azure_helper/content_understanding.py:326

  • get_image_from_analyze_operation() now describes retrieving a generic generated file via the GA /files/{id} route, but it still asserts Content-Type == 'image/jpeg'. This will raise an AssertionError if the service returns a different image type (or any non-JPEG file), even when the download is successful. Consider removing the strict assert or relaxing it (e.g., validate an image/ prefix only when the caller expects an image).
            response = requests.get(url=image_retrieval_url, headers=self._headers)
            response.raise_for_status()

            assert response.headers.get("Content-Type") == "image/jpeg"

Comment thread src/ContentProcessor/src/libs/azure_helper/model/content_understanding.py Outdated
Pre-existing Optional[List[...]] = [] defaults silently parsed an
explicit JSON null to None, which would cause a TypeError when the
confidence evaluator iterates page.lines (line 132). Switch to the
same Field(default_factory=list) pattern used for the other Page
collections so these fields fail loudly at parse time on a malformed
response and remain safe to iterate at every call site.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Harsh-Microsoft Harsh-Microsoft merged commit 1fbb362 into dev May 8, 2026
35 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants